Cross-Fertilizing Deep Web Analysis and Ontology Enrichment

نویسندگان

  • Marilena Oita
  • Antoine Amarilli
  • Pierre Senellart
چکیده

Deep Web databases, whose content is presented as dynamicallygenerated Web pages hidden behind forms, have mostly been left unindexed by search engine crawlers. In order to automatically explore this mass of information, many current techniques assume the existence of domain knowledge, which is costly to create and maintain. In this article, we present a new perspective on form understanding and deep Web data acquisition that does not require any domain-specific knowledge. Unlike previous approaches, we do not perform the various steps in the process (e.g., form understanding, record identification, attribute labeling) independently but integrate them to achieve a more complete understanding of deep Web sources. Through information extraction techniques and using the form itself for validation, we reconcile input and output schemas in a labeled graph which is further aligned with a generic ontology. The impact of this alignment is threefold: first, the resulting semantic infrastructure associated with the form can assist Web crawlers when probing the form for content indexing; second, attributes of response pages are labeled by matching known ontology instances, and relations between attributes are uncovered; and third, we enrich the generic ontology with facts from the deep Web. 1. ONTOLOGIES AND THE DEEP WEB The deep Web consists of dynamically-generated Web pages that are reachable by issuing queries through HTML forms. A form is a section of a document with special control elements (e.g., checkboxes, text inputs) and associated labels. Users generally interact with a form by modifying its controls (entering text, selecting menu items) before submitting it to a Web server for processing. Forms are primarily designed for human beings, but they must also be understood by automated agents for various applications such as general-purpose indexing of response pages, focused indexing [13], extensional crawling strategies (e.g., Web archiving), automatic construction of ontologies [29], etc. However, most existing approaches to automatically explore and classify the deep Web crucially rely on domain knowledge [10, 12, 30] to guide form understanding. Moreover, they tend to separate the steps of form interface understanding and information extraction from result pages, although both contribute [27] to a more authentic vision on the backend database schema. The form interface exposes in the input schema some attributes describing the query object, while response pages present this object instantiated in Web records that outline the form output schema. In this paper, we determine a mapping beVLDS’12 August 31, 2012. Istanbul, Turkey. Copyright c © 2012 for the individual papers by the papers’ authors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. tween the input and output schemas which associates the data types corresponding to form elements in the input schema to instances aligned in the output schema. A harder challenge is to understand the semantics of these data types and how they relate to the object of the form. The input–output schema mapping may give us hints, such as the input schema labels, but this information cannot suffice by itself. This has been addressed in related work using heuristics [26] or an assumed domain knowledge [19] which is either manually crafted or obtained by merging different form interface schemas belonging to the same domain. Domain knowledge is, however, not only hard to build and maintain, but also often restricted to a choice of popular domain topics, which may lead to biased exploration of the deep Web. We present a new way to deal with this challenge: we initially probe the form in a domain-agnostic manner and transform the information extracted from response pages into a labeled graph. This graph is then aligned with a general-domain ontology, YAGO [23], using the PARIS ontology alignment system [22]. This allows us to infer the semantics of the deep Web source, to obtain new, representative query terms from YAGO for the probing of form fields, and to possibly enrich YAGO with new facts.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Systematic enrichment analysis of microRNA expression profiling studies in endometriosis

Objective(s): The purpose of this study was to conduct a meta-analysis on human microRNAs (miRNAs) expression data of endometriosis tissue profiles versus those of normal controls and to identify novel putative diagnostic markers. Materials andMethods: PubMed, Embase, Web of Science, Ovid Medline were used to search for endometriosis miRNA expression profiling studies of endometriosis. The miRN...

متن کامل

agriGO: a GO analysis toolkit for the agricultural community

Gene Ontology (GO), the de facto standard in gene functionality description, is used widely in functional annotation and enrichment analysis. Here, we introduce agriGO, an integrated web-based GO analysis toolkit for the agricultural community, using the advantages of our previous GO enrichment tool (EasyGO), to meet analysis demands from new technologies and research objectives. EasyGO is valu...

متن کامل

Deeper: A Data Enrichment System Powered by Deep Web

Data scientists often spend more than 80% of their time on data preparation. Data enrichment, the act of extending a local database with new attributes from external data sources, is among the most time-consuming tasks. Existing data enrichment works are resource intensive: data-intensive by relying on web tables or knowledge bases, monetarily-intensive by purchasing entire datasets, or timeint...

متن کامل

GOEAST: a web-based software toolkit for Gene Ontology enrichment analysis

Gene Ontology (GO) analysis has become a commonly used approach for functional studies of large-scale genomic or transcriptomic data. Although there have been a lot of software with GO-related analysis functions, new tools are still needed to meet the requirements for data generated by newly developed technologies or for advanced analysis purpose. Here, we present a Gene Ontology Enrichment Ana...

متن کامل

CPSS: a computational platform for the analysis of small RNA deep sequencing data

UNLABELLED Next generation sequencing (NGS) techniques have been widely used to document the small ribonucleic acids (RNAs) implicated in a variety of biological, physiological and pathological processes. An integrated computational tool is needed for handling and analysing the enormous datasets from small RNA deep sequencing approach. Herein, we present a novel web server, CPSS (a computationa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012